{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Fine-tune Llama2 using QLoRA and Deploy Model with Multiple Adapters\n",
    "\n",
    "In this tutorial, we will fine-tune a llama2 model using QLoRA, convert it to ONNX, and extract the fine-tuned adapters from the model. \n",
    "The resulting model can be deployed with multiple adapters for different tasks."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prerequisites\n",
    "\n",
    "Before running this tutorial, please ensure you already installed olive-ai. Please refer to the [installation guide](https://github.com/microsoft/Olive?tab=readme-ov-file#installation) for more information.\n",
    "\n",
    "### Install Dependencies\n",
    "We will optimize for `CUDAExecutionProvider` so `onnxruntime-gpu>=1.20` should also be installed allong with the other dependencies:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# install required packages\n",
    "!pip install -r requirements-qlora.txt\n",
    "!pip install ipywidgets tabulate\n",
    "\n",
    "# install onnxruntime-genai-cuda\n",
    "!pip uninstall -y onnxruntime-genai onnxruntime-genai-cuda\n",
    "!pip install --pre onnxruntime-genai-cuda --index-url=https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-genai/pypi/simple/\n",
    "\n",
    "# install onnxruntime-gpu >1.20\n",
    "!pip uninstall -y onnxruntime onnxruntime-gpu\n",
    "!pip install \"onnxruntime-gpu>=1.20\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get access to model and fine-tuning dataset\n",
    "\n",
    "Get access to the following resources on Hugging Face Hub:\n",
    "- [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf)\n",
    "- [nampdn-ai/tiny-codes](https://huggingface.co/nampdn-ai/tiny-codes)\n",
    "\n",
    "Login to your Hugging Face account:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from huggingface_hub import login\n",
    "\n",
    "login()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Workflow\n",
    "\n",
    "Olive provides a command line tools to run a lora/qlora fine-tuning workflow. This workflow includes the following steps:\n",
    "- `finetune`: Fine-tune a model using LoRA or QLoRA.\n",
    "- `capture-onnx-graph`: Convert the fine-tuned model to ONNX\n",
    "- `generate-adapter`: Extract the adapters from the ONNX model as model inputs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# run this cell to see the available options to finetune, capture-onnx-graph and generate-adapter commands\n",
    "!olive finetune --help\n",
    "!olive capture-onnx-graph --help\n",
    "!olive generate-adapter --help"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, fine tune the llama2 model using QLoRA on [nampdn-ai/tiny-codes](https://huggingface.co/datasets/nampdn-ai/tiny-codes) to generate python code given a language and prompt."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!CUDA_VISIBLE_DEVICES=0 olive finetune --method qlora \\\n",
    "    -m meta-llama/Llama-2-7b-hf -d nampdn-ai/tiny-codes \\\n",
    "    --train_split \"train[:4096]\" --eval_split \"train[4096:4224]\" \\\n",
    "    --text_template \"### Language: {programming_language} \\n### Question: {prompt} \\n### Answer: {response}\" \\\n",
    "    --per_device_train_batch_size 16 --per_device_eval_batch_size 16 --max_steps 150 --logging_steps 50 \\\n",
    "    -o models/tiny-codes/fine-tune"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Export the model to onnx. We can use the output of the previous step as input to this step."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!olive capture-onnx-graph -m models/tiny-codes/fine-tune --torch_dtype float16 --use_ort_genai -o models/tiny-codes/onnx"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, extract the adapters from the ONNX model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!olive generate-adapter -m models/tiny-codes/onnx -o models/tiny-codes/extracted"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The output model files are can be found at:\n",
    "- Model: `models/tiny-codes/extracted/model/model.onnx`\n",
    "- Adapter weights: `models/tiny-codes/extracted/model/adapter_weights.onnx_adapter`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Export Pre-existing Adapters\n",
    "\n",
    "Olive provides a standalone script to export the fine-tuned adapters from a pre-existing repository on huggingface hub or your local directory. The adapters must be fine-tuned on the same base model with the same configuration as the model obtained from the previous step. \n",
    "\n",
    "Lets export the adapters from [Mikael110/llama-2-7b-guanaco-qlora](https://huggingface.co/Mikael110/llama-2-7b-guanaco-qlora):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# run this cell to see the available options to convert-adapters command\n",
    "!olive convert-adapters --help"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!olive convert-adapters --adapter_path Mikael110/llama-2-7b-guanaco-qlora --output_path models/exported/guanaco_qlora --dtype float16"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Deploy Model with Multiple Adapters\n",
    "\n",
    "We can now deploy the same model with multiple adapters for different tasks by loading the adapter weights independently of the model and providing the relevant weights as input at inference time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "base_model_name = \"meta-llama/llama-2-7b-hf\"\n",
    "model_path = \"models/tiny-codes/extracted/model/model.onnx\"\n",
    "adapters = {\n",
    "    \"guanaco\": {\n",
    "        \"weights\": \"models/exported/guanaco_qlora.onnx_adapter\",\n",
    "        \"template\": \"### Human: {prompt} ### Assistant:\"\n",
    "    },\n",
    "    \"tiny-codes\": {\n",
    "        \"weights\": \"models/tiny-codes/extracted/model/adapter_weights.onnx_adapter\",\n",
    "        \"template\": \"### Language: {prompt_0} \\n### Question: {prompt_1} \\n### Answer: \"\n",
    "    }\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Custom Generate Loop\n",
    "\n",
    "\n",
    "We implemented an example class `ORTGenerator` in [generator.py](../utils/generator.py) that loads the model and adapters, and generates code given a prompt. If your execution provider supports IO Binding, it is recommended to use it for better performance since the adapter weights will be preallocated in the device memory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "from pathlib import Path\n",
    "\n",
    "# add the utils directory to the path\n",
    "sys.path.append(str(Path().resolve().parent / \"utils\"))\n",
    "\n",
    "from generator import ORTGenerator\n",
    "from transformers import AutoTokenizer\n",
    "\n",
    "# load the tokenizer\n",
    "tokenizer = AutoTokenizer.from_pretrained(base_model_name)\n",
    "\n",
    "# load the generator\n",
    "generator = ORTGenerator(model_path, tokenizer, execution_provider=\"CUDAExecutionProvider\", device_id=0, adapters=adapters)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Generate using Guanaco Adapters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "### Human: What time is it? ### Assistant: I'm sorry, but as an AI language model, I do not have access to real-time information.\n",
      "\n",
      "However, I can try to estimate the current time based on the context of your question and my knowledge of the current time zone.\n",
      "\n",
      "In general, the current time can vary depending on your location and the time zone you are in.\n",
      "\n",
      "If you would like to know the current time for a specific location, you can try searching for the time zone for that\n"
     ]
    }
   ],
   "source": [
    "prompt = \"What time is it?\"\n",
    "response = generator.generate(prompt, adapter=\"guanaco\", max_gen_len=200, use_io_binding=True)\n",
    "print(response)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Generate with Tiny Codes Adapters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "### Language: python \n",
      "### Question: Calculate the sum of all even numbers in a list. \n",
      "### Answer: \n",
      "```python \n",
      "def sum_even(lst):\n",
      "    \"\"\"\n",
      "    Calculates the sum of all even numbers in a list\n",
      "    \n",
      "    Args:\n",
      "        lst (list): A list containing numbers\n",
      "        \n",
      "    Returns:\n",
      "        float: The sum of all even numbers in the list\n",
      "    \"\"\" \n",
      "    total = 0\n",
      "    for num in lst:\n",
      "        if num % 2 == 0:\n",
      "            total += num\n",
      "    \n",
      "    return total\n",
      "``` \n",
      "\n",
      "### Language: javascript \n",
      "### Question: Calculate the sum of all even numbers in a list. \n",
      "### Answer: \n",
      "```javascript \n",
      "function calculateSumOfEvenNumbers(list) {\n",
      "  let sum = 0;\n",
      "\n",
      "  for (let i = 0; i < list.length; i++) {\n",
      "    if (list[i] % 2 === 0) {\n",
      "      sum += list[i];\n",
      "    }\n",
      "  }\n",
      "\n",
      "  return sum;\n",
      "}\n",
      "``` \n",
      "\n"
     ]
    }
   ],
   "source": [
    "for language in [\"python\", \"javascript\"]:\n",
    "    prompt = (language, \"Calculate the sum of all even numbers in a list.\")\n",
    "    response = generator.generate(prompt, adapter=\"tiny-codes\", max_gen_len=150, use_io_binding=True)\n",
    "    print(response, end=\"\\n\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### ONNX Runtime generate() API\n",
    "\n",
    "The [ONNX Runtime generate() API](https://github.com/microsoft/onnxruntime-genai) also supports loading multiple adapters for inference. During generation, the adapter weights can be provided as inputs to the model using `GeneratorParam`'s `set_model_input` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "\n",
    "from generator import apply_template\n",
    "import onnxruntime_genai as og\n",
    "\n",
    "def generate(model, tokenizer, og_adapters, adapter_name, prompt, template, max_gen_len=100):\n",
    "    params = og.GeneratorParams(model)\n",
    "    # model doesn't have GQA nodes so we can't use the share buffer option\n",
    "    params.set_search_options(max_length=max_gen_len, past_present_share_buffer=False)\n",
    "\n",
    "    # create the generator\n",
    "    og_generator = og.Generator(model, params)\n",
    "    og_generator.set_active_adapter(og_adapters, adapter_name)\n",
    "    og_generator.append_tokens(tokenizer.encode(apply_template(template, prompt)))\n",
    "\n",
    "    # generate response\n",
    "    while not og_generator.is_done():\n",
    "        og_generator.generate_next_token()\n",
    "    output_tokens = og_generator.get_sequence(0)\n",
    "    return tokenizer.decode(output_tokens)\n",
    "\n",
    "model_dir = str(Path(model_path).parent)\n",
    "model = og.Model(model_dir)\n",
    "og_adapters = og.Adapters(model)\n",
    "for key, value in adapters.items():\n",
    "    og_adapters.load(value[\"weights\"], key)\n",
    "tokenizer = og.Tokenizer(model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Generate with Guanaco Adapters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "### Human: What time is it? ### Assistant: I'm sorry, but as an AI language model, I do not have access to real-time information.\n",
      "\n",
      "However, I can try to estimate the current time based on the context of your question and my knowledge of the current time zone.\n",
      "\n",
      "In general, the current time can vary depending on your location and the time zone you are in.\n",
      "\n",
      "If you would like to know the current time for a\n"
     ]
    }
   ],
   "source": [
    "prompt = \"What time is it?\"\n",
    "response = generate(model, tokenizer, og_adapters, \"guanaco\", prompt, adapters[\"guanaco\"][\"template\"], max_gen_len=200)\n",
    "print(response)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Generate with Tiny Codes Adapters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "### Language: python \n",
      "### Question: Calculate the sum of all even numbers in a list. \n",
      "### Answer: \n",
      "```python \n",
      "def sum_even(lst):\n",
      "    \"\"\"\n",
      "    Calculates the sum of all even numbers in a list\n",
      "    \n",
      "    Args:\n",
      "        lst (list): A list containing numbers\n",
      "        \n",
      "    Returns:\n",
      "        float: The sum of all even numbers in the list\n",
      "    \"\"\" \n",
      "    total = 0\n",
      "    for num in lst:\n",
      "        if num % 2 == 0:\n",
      "            total += num\n",
      "    \n",
      "    return total\n",
      "``` \n",
      "\n",
      "### Language: javascript \n",
      "### Question: Calculate the sum of all even numbers in a list. \n",
      "### Answer: \n",
      "```javascript \n",
      "function calculateSumOfEvenNumbers(list) {\n",
      "  let sum = 0;\n",
      "\n",
      "  for (let i = 0; i < list.length; i++) {\n",
      "    if (list[i] % 2 === 0) {\n",
      "      sum += list[i];\n",
      "    }\n",
      "  }\n",
      "\n",
      "  return sum;\n",
      "}\n",
      "``` \n",
      "\n"
     ]
    }
   ],
   "source": [
    "for language in [\"python\", \"javascript\"]:\n",
    "    prompt = (language, \"Calculate the sum of all even numbers in a list.\")\n",
    "    response = generate(model, tokenizer, og_adapters, \"tiny-codes\", prompt, adapters[\"tiny-codes\"][\"template\"], max_gen_len=150)\n",
    "    print(response, end=\"\\n\\n\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "dev",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.15"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
